From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
نویسندگان
چکیده
We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.
منابع مشابه
A study of viewpoints of English language instructors to motivate Lerner to learning English through curricular; representation of a Model
One of the problems of students’ entrance from secondary education to university is lack of English language skills and incentive to improve their learning. This research aims to identify the ways to strengthen English language skills with an emphasis on undergraduate students' motivation. This research is qualitative approach and Grounded theory strategy. The study population has been consis...
متن کاملEncoding of phonology in a recurrent neural model of grounded speech
We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech s...
متن کاملRepresentations of language in a model of visually grounded speech signal
We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaningbased linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by dif...
متن کاملA Social Semiotic Analysis of Social Actors in English-Learning Software Applications
This study drew upon Kress and Van Leeuwen’s (2006, [1996]) visual grammar and Van Leeuwen’s (2008) social semiotic model to interrogate ways through which social actors of different races are visually and textually represented in four award-winning English-learning software packages. The analysis was based on narrative actional/reactional processes at the ideational level; mood, perspective, ...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کامل